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Abstract 

Genetic algorithm behavior is described in terms of the construction and evolution 
of the sampling distributions over the space of candidate solutions. This novel per- 
spective is motivated by analysis indicating that the schema theory is inadequate for 
completely and properly explaining genetic algorithm behavior. Based on the proposed 
theory, it is argued that the similarities of candidate solutions should be exploited di- 
rectly, rather than encoding candidate solutions and then exploiting their similarities. 
Proportional selection is characterized as a global search operator, and recombination 
is characterized as the search process that exploits similarities. Sequential algorithms 
and many deletion methods are also analyzed. It is shown that by properly constrain- 
ing the search breadth of recombination operators, convergence of genetic algorithms 
to a global optimum can be ensured. 


1 Introduction 

Genetic algorithms are adaptive systems designed to emulate natural evolution. They were 
first proposed by John Holland in 1975 in his seminal work Adaptation in Natural and 
Artificial Systems (Holland, 1975). De Jong suggests that genetic algorithms should be 
understood from the perspectives of genotypic and phenotypic behavior, as well as their 
performance as global optimizers (De Jong, 1993). This paper contributes to this goal by 
describing genetic algorithm behavior in terms of the sampling distributions they impose on 
the genospace and the phenospace, and how these distributions contribute to or detract from 
the optimization process. 

'This work was supported by a contract from the NASA Space Engineering Center for System Health 
Management Technology at the University of Cincinnati. 
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While genetic algorithms have been shown to be effective in many problem domains, the 
theoretical foundation for describing, explaining, and predicting their behavior is presently 
inadequate. As argued in Section 2, the prevailing theory of genetic algorithm behavior, the 
schema theory, is not a suitable theory for describing genetic algorithm behavior. Accord- 
ingly, the primary objective of this paper is to generalize genetic algorithms and to provide 
an adequate basis for their understanding and analysis (Sections 3 & 4). A second objective 
of this paper is to explore the issues and variations of genetic algorithms permitted by their 
generalization in the context of the proposed explanation of genetic algorithm behavior (Sec- 
tion 5). The final objective of this paper is to determine the conditions under which genetic 
algorithms can be assured to converge to a global optimum (Section 6). Finally, conclusions 
and suggestions for future research are presented (Section 7). 

2 Descriptions and Analyses of Genetic Algorithm Be- 
havior 

In this section, descriptions and analyses of genetic algorithm behavior are considered. Natu- 
rally, the most basic description of a genetic algorithm and the fundamental basis of analysis 
is its definition. For the purposes of this paper, the canonical genetic algorithm is defined by 
Procedure 1. In step 3 and throughout the paper, the recombination of parental encodings is 
taken to include the effects of both mutation and crossover. Common recombination opera- 
tors and fitness scaling techniques are described throughout the literature (general coverage 
is provided in (Holland, 1975; Goldberg, 1989a; Davis, 1991)). In subsection 2.1, where the 
schema theory is considered, it is assumed that no fitness scaling is used and that the entire 
population of chromosomes is replaced each generation. 

Procedure 1 The Canonical Genetic Algorithm 

1. Initialize a population of chromosomes (binary strings). 
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2. Evaluate each chromosome in the population by applying the objective function to its 
corresponding candidate solution. 

3. Create new chromosomes by applying a fitness scaling technique to the chromosome 
evaluations, choosing parent chromosomes according to their relative fitness, and re- 
combining their encodings. 

4. Delete members of the population to make room for the new chromosomes. 

5. Evaluate each new chromosome as in Step 2, and insert it into the population. 

6. If the stopping criterion has been satisfied, then stop and return the chromosome with 
the best observed fitness; otherwise continue with Step 3. 

While the procedural description is complete and exact, it is not adequate for conveying 
a suitable understanding of genetic algorithm behavior. This description is able to explain 
phenomena arising from the use of a genetic algorithm only at the lowest level of abstrac- 
tion and understanding. Since this description operates at the experimental, practical, or 
phenomenal level, it does not constitute a theory. Consequently, the inadequacies of this 
description have given rise to the schema theory and other analyses of genetic algorithms, 
such as Markov chain analysis. 

In the remainder of this section, the suitability of existing analyses of genetic algorithm 
behavior are considered on the basis of the following criteria: 

1. The theory should be well grounded in the procedural elements and the generating 
mechanisms of genetic algorithms. These include the processes of selection, recombi- 
nation, fitness evaluation, and population management. 

2. The theory should have explanatory and predictive power. 

3. The theory should be robust with respect to algorithmic variations. 
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Furthermore, in consideration of Occam’s razor, the preferred theory is the simplest and 
most closely grounded to that which is known ( i.e the procedural elements and generating 
mechanisms). 

In this paper, an individual string is denoted A or Aj , where j = 1,2,..., N, and N 
is the size of the population A (t) at time t. The objective or fitness function is denoted 
/ : a -+ SI 1 > 0. A schema, its order , and its defining length, are denoted H, o(H), and 
6(H), respectively. A schema’s order is the number of fixed positions or string elements 
common to all members of the schema, and its defining length is the distance between the 
schema’s first and last fixed positions. 


2.1 The Schema Theory 


According to the schema theory, genetic algorithms work in the space of schemata as opposed 
to the space of strings. Therefore, it is necessary to understand the effects of reproduction 
and the recombination operators on the schemata contained within a population in order 
to understand the behavior of genetic algorithms within the context of the schema theory. 
When proportional selection is used, the probability of selecting A^t, the j th individual in 
the population at time t, as a parent is 


Vj,t = 


Mm) 

E /0M’ 


( 1 ) 


and, the target sampling rate of a schema H is 

f(H,t) 


E{m(H, t + 1)} > m(H,t)- 


6(H) 

1 -Pc- j— j- - o(H)p„ 


( 2 ) 


/(A(t)) L 

where m(H, t) is the number of representatives of H in the population at time t (Grefen- 
stette &: Baker, 1989), j(H, t.) is the average fitness of the representatives of H in the present 
population, f(A(t)) is the average fitness of the present population, p c is the crossover prob- 
ability, and p m is the mutation probability. Based on (2), it has been concluded that small, 
low-order schemata with above-average performance are allocated exponentially increasing 
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trials in subsequent generations (Goldberg, 1989a). An important observation in the schema 
theory is that each binary string implicitly searches or samples 2 1 schemata. According to 
the theory, this implicitly acquired information is then used for trial allocation to schemata 
and to generate increasingly better strings. It has been argued that implicit parallelism 
leverages the power of genetic algorithms (Goldberg, 1989a), and allows them to avoid the 
obstacles of high dimensionality (Holland, 1975). Equation (2) is often referred to as the 
Schema Theorem or the Fundamental Theorem of Genetic Algorithms (Goldberg, 1989a). 

The schema theory will now be evaluated according to the suitability criteria established 
at the beginning of this section. 

1. The allocation of trials to schemata in a manner consistent with the schema theorem 
is certaiidy well grounded to the procedural elements. However, schema information 
is not used in the procedure for trial allocation or any other purpose. Therefore, the 
use of acquired schema information to guide or affect genetic algorithm behavior has 
no tangible basis and is not well grounded (Peck, 1993, §3.2.5). 

2. The schema theory has lead to useful, verifiable predictions (e.g., see (Fitzpatrick k 
Grefenstette, 1988; Goldberg, Deb k Clark, 1992; Goldberg, Deb k Clark, 1993)). 
However, the schema theory is inexact due to the inequality in (2). Furthermore, the 
schema theory and the building block hypothesis are unable to explain how genetic 
algorithms systematically generate improved candidate solutions, since they depend 
on the use of implicitly acquired schema information (Peck, 1993, §3.2.5). 

3. The schema theory, as presented in this paper, is not robust with respect to algorithmic 
variations (Peck, 1993, §3.2.5). Genetic algorithm variants using fitness scaling, rank- 
ing, and/or real (floating point) encodings are difficult, if not impossible, to explain 
within the context of the schema theory. The attempts that have been made require a 
new interpretation of the schema theory or higher-order abstractions (Whitley, 1989; 
Goldberg, 1991a; Goldberg, 1991b). Similar algorithms, such tvs evolution strategies 
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and evolutionary proyrarnming (Back k Schwefel, 1993), are beyond the scope of the 
schema theory. 

It has also been observed that schema-based analysis of genetic algorithm behavior is greatly 
complicated by the difficulties in associating properties to schemata (Forrest k Mitchell, 
1993; Grefenstette k Baker, 1989; Grefenstette, 1991; Grefenstette, 1993; Peck, 1993; Peck k 
Dhawan, 1993). Finally, since genetic algorithms do not use schema information, there is no 
basis to conclude that genetic algorithms realize advantages from implicit parallelism (Peck, 
1993). 

2.2 Alternative Analyses of Genetic Algorithms 

While the primary basis of genetic algorithm analysis has been the schema theory, other 
types of analysis have been pursued as well. The primary bases of alternative analysis have 
been Markov chain and simulated annealing theory. Most of the analyses in the literature 
have only sought to address specific issues, have made simplifying assumptions, or have not 
been dependent on the distinguishing characteristics of genetic algorithms (De Jong, 1975; 
Goldberg k Segrest, 1987; Rabinovich k Wigderson, 1991; Eiben, Aarts k Hee, 1991; Davis 
k Principe, 1991). 

The theory presented in (Vose k Liepins, 1991a; Nix k Vose, 1992; Vose, 1993a) rep- 
resents the most accurate and complete alternative theory of genetic algorithm behavior 
in the literature. In (Vose k Liepins, 1991a), Vose and Liepins present a novel, algebraic 
formalization and analysis of a simple genetic algorithm. Using Markov chain analysis, with 
the state defined by the composition of an infinite sized population, the trajectory of the 
expected populations is modeled, and the conditions for convergence to the absorbing states 
of the transition mapping are derived. In (Nix k Vose, 1992), the formalism of the Vose and 
Liepins model is applied to a simple genetic algorithm with a finite population size. It is 
concluded that, as the population size increases, the asymptotic behavior of the steady state 
distributions may be characterized in terms of the Vose and Liepins model. In (Vose, 1993a), 
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the two preceding works are further tied together, and the GA-surface is introduced. The 
GA-surface, which is composed of the points corresponding to populations, may be used to 
provide a geometric interpretation of genetic search and to explain population trajectories. 

The theory contained in (Vose & Liepins, 1991a; Nix & Vose, 1992; Vose, 1993a) will 
now be interpreted in the context of the criteria established at the beginning of this section: 

1. The construction and operation of the population transition operators is well grounded 
in the procedural elements and generating mechanisms of genetic algorithms. In fact, 
the representations in (Nix Vose, 1992) and (Vose & Liepins, 1991a) are exact for 
finite and infinite populations, respectively. 

2. Since the representations are exact, any phenomena observed of genetic algorithms 
will be explainable within their contexts. As an example, observations of punctuated 
equilibrium are explainable in the context of the infinite population representation. 
Furthermore, many predictions regarding short and long term behavior have been 
derived from this analysis. 

3. Markov chain representations may be generated for nearly any algorithmic variant. 
Derived properties must naturally be proved for each variant. 

The above analysis suggests that a suitable theory for genetic algorithm analysis has 
been constructed. There is, however, a subtle caveat to this conclusion: the explanatory 
power of this work is hampered by lumping genetic algorithm behavior into a population 
transition operator. There are many low-level phenomena of genetic algorithms that are not 
adequately understood, and a high-level, unitary abstraction such as a population transition 
operator may have difficulty explaining them. A level of abstraction operating between 
the low-level abstraction of the procedure and the high-level abstraction of the transition 
operator is desired. 
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3 Global Random Search Methods: An Overview 


This section reviews the theory of global random search methods. This theory serves as the 
basis for an alternative theory of genetic algorithm behavior, which is presented in Section 4. 
The presentation throughout this section primarily summarizes and clarifies the analysis and 
results presented by Zhigljavsky (Zhigljavsky, 1991). A more thorough summary of these 
results is presented in (Peck, 1993). 

This section begins with an introduction to global search methods. This is followed by 
a presentation of basic global random search methods. Finally, generational methods and 
their convergence properties are examined. 

3.1 Introduction and Notation 

In the typical global optimization problem, it is desired to optimize an objective function, 
which may be a mathematical expression or the output of an algorithm, process, experiment, 
or system. Let X denote a set referred to as the feasible region and / : X — * 9? 1 be the 
objective function. In the global minimization problem, it is desired to approximate either 
the value 


• 

II 

a\~ 

1JT 

(3) 

the point x* € X at which the minimal value f* is attained, 


x* = arg min /(.'/;), 

(4) 

or both. The global minimizer, z*, is not generally unique. 


Approximating /* and a point x * = arg min / is usually interpreted 

as finding a point in 

either the set 


A(i) = (.«: |/M - /M)l < 6), 

(5) 

or the set 


B(e) = B(x' ,e,p) = {r. eX : p(x,x')<e}. 

(6) 
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where p is the given metric on X , S, and e determine the accuracy of the approximation with 
respect to the function and argument values (Zhigljavsky, 1991, pg. 2). 

In the global maximization problem, alternatively, the objective is to approximate either 
the value 

M = sup/(x), (7) 

x£X 

the global maximizer, which will also be denoted x*, where 

x* = argmax/(x), (8) 

x£X 

or both. The meaning of x* will be understood through context. It should also be noted that 
by substituting — / for /, the maximization problem may be converted into a minimization 
problem, and vice versa. To avoid redundancy, oidy the minimization problem will be 
addressed for the remainder of this and the next subsection. 

Generally, a global minimization method is a procedure for constructing a sequence {x*} 
of points in X that converges to a point at which the global minimizer, /*, is attained 
or approximated (Zhigljavsky, 1991, pg. 1). The nature of convergence depends on the 
optimization method. For example, convergence may be of the values of f(xk) to /* or of 
the sequence {xk} to a probability measure concentrated at x*. This procedure may use a 
priori information about X or /, such as values of /, it derivatives, or the presence and 
nature of random noise. 

The complexity of the optimization problem is dependent on the properties of X and /. 
Furthermore, there exists a duality between the corresponding properties (Zhigljavsky, 1991, 
pg. 2). Specifically, if X is complex but / is simple, then the optimization problem may be 
reformulated such that X is simple and / is complex, and vice versa. 

As stated above, the nature of X effects the complexity of the optimization problem and 
should be considered in the selection of the optimization technique. In general, unlike local 
optimization, global optimization cannot be done if X is not bounded. Some techniques 
require that X possess certain properties (e.g., that X be closed, compact, connected, etc.). 
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Other important considerations include the choice of a metric on X, techniques for reducing 
the complexities associated with problem constraints, and the dimension n of X when X C 
SR n (Zhigljavsky, 1991, pg. 3) 

The optimization method is typically selected, in part, based on the functional class, F, 
of /, which is determined by prior knowledge of /. The chosen functional class corresponds 
to a model of /. The wider the functional class T is, the wider the class of allowable problems 
is, and the less efficient the algorithms are (Zhigljavsky, 1991, pg. 3). 

3.2 Basic Global Random Search Methods 

Global random search methods may be classified as passive or adaptive. Passive methods, 
such as uniform random sampling (pure random search), proceed without exploiting infor- 
mation learned about / on X. Consequently, these methods are typically quite simple, but 
they are also quite inefficient. Adaptive methods, conversely, use acquired and a priori infor- 
mation to improve their efficiency. For a brief survey of adaptive methods, see (Zhigljavsky, 
1991, pg. 82). 

3.2.1 Formalization of Global Random Search Methods 

The following procedure represents a generalization and formalization of global random 
search methods. It is intended to serve as the basis of comparison and discussion of the 
various methods considered in this paper. 

Procedure 2 Formal Scheme of Global Random Search (Zhigljavsky, 1991, Algorithm 3.1.5, 
pg. 85) 

1. Set k — 1, choose a probability distribution Pi on X. 

2. Sample Nk times the distribution to obtain the points 

•*•1 , •••> 3 - N k - 

At each of these points, evaluate /, possibly with random noise. 
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3. Using a fixed, algorithm-dependent rule, construct the probability distribution Pjt+i 
on X. 

4. If the stopping criterion is satisfied, then stop; otherwise, set k = k + 1 and continue 
with Step 2. 

This procedure illustrates that any global random search method is iterative. Furthermore, 
at each iteration a suitably constructed distribution is sampled (Zhigljavsky, 1991, pg. 85). 
In Markovian methods, N k = 1 for all k. 

The distributions {Ft+i} determine how a priori information and the information ac- 
quired during the search process is derived and exploited by the search algorithm. Without 
loss of generality, the distributions may be written in the form 

P k +i(dx)= [ R k (dz)Q k (z,dx), (9) 

J X 

where R k is a probability distribution on X and Q k {z, .) is a Markovian transition prob- 
ability (Zhigljavsky, 1991, pg. 85 ). The transition probability, Qk(z, •). is a measurable, 
nonnegative function with respect to the first argument and a probability measure with re- 
spect to the second. Sampling this distribution is performed by sampling R k (dz) to obtain z , 
then sampling Q k (z,dx) to obtain x, the desired sample. As shown below, R k and Q k {z, ■), 
serve two distinct roles in the search strategy. 

The distribution R k comprises the global aspects of the search strategy. Accordingly, 
Rk is constructed using globally derived information about /, and a point from all of X is 
chosen when sampling R k - The method for constructing R k largely determines the general 
structure of the algorithm, and it is the typical basis for algorithm classification. Common 
classes of algorithms include Markovian, generational, and branch and bound. 

The distribution Q k (z, .) comprises the local aspects of the search strategy. When sam- 
pling Qk{z, •), a point in the neighborhood of z is selected. The term neighborhood should be 
interpreted to mean “with large probability near enough (Zhigljavsky, 1991, pg. 86).” The 
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nature of Q k {z, ■) largely determines the tradeoff between the accuracy of the final result 
and the efficiency of the search. A simple choice of Q k (z , •) is 


Qk(z,dx) = 


<p k (x — z)dx 

I Vk{v~z)dy 

J X 


( 10 ) 


where (fit is a chosen distribution density in 5J n . The denominator of (10) is a normalization 
constant. A random realization x k in X from the distribution in (10) may be obtained by 
repeatedly sampling <p k to obtain a realization & until z + £ k 6 X, then setting x k = z + &. 
The distribution described above is the method of choice when random noise is present in 
the evaluations of / (Zhigljavsky, 1991, pg. 8G). It is also useful as a component of other 
distributions. 

When / is evaluated without noise, the following distributions for Q k (z,.) are often 
preferred: 


Q k (z, A) = f l[xeA,M<f(z)}T k (z, dr) + U(z) f 1 [/(*)</(*)] T k (z t dx ), (11) 

J X j x 


where T k (z,dx) is a Markovian transition probability of the form expressed in (10) and 1^ 
is the indicator of set A: 


_ j 1 if x 

— | 0 if x 


€ A 
<£A. 


( 12 ) 


1 A (x) — 1 [i6A] 

The first integral represents the probability of sampling a point x G A for which /(x) < f{z). 
The second integral, which only contributes to the sum if z € A, is the probability of sampling 
a point x £ X for which f{x) > f(z). A realization x k from (11) may be obtained by sampling 
the distribution T k (z , .) to get £ k and setting 


x fc = 


& if /(&)</(*) 

z otherwise. 


Other methods for constructing Q k (z, .) exist. In fact, it is not necessary to know the 
analytical form of Q k {z,.), it is only necessary that a method for sampling, such as an 
algorithm, exists (Zhigljavsky, 1991, pg. 87). Furthermore, Q k {z,.) may be constructed 
tising a priori information or information acquired during the search. 
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3.2.2 General Convergence Results 

In this section, Zhigljavsky’s general results on the convergence of global random search 
methods will be presented without proof. For the proofs, the interested reader should refer 
to (Zhigljavsky, 1991, §3.2). Without loss of generality, it will be assumed that N k = 1 for 

all k = 1,2,... such that a separate distribution P k is constructed for each sampled point, 
(fc) 

x k = x) '. 

Theorem 1 Let f be continuous in the vicinity of a global minimizer x* of f , and assume 
that 

X> = oo ( 13 ) 

for any x G X and e > 0 where 

q k = 7 fc(x*,e) = yraiinf Pt(B(e)), E k -i = {xi, . . . ,x*-i}, 
and vrai inf r) is the essential infimum of a random variable tj: 

vrai inf 77 = sup {a : Pr{77 > a} = 1} . 

Then for any S > 0 the sequence of random vectors x k generated by Procedure 2 with N k = 1 
for k = 1, 2, . . . falls infinitely often into the set A(S) with probability one. 

Theorem 1 makes use of the probabilities, for each iteration, of falling into an arbitrarily 
small set around a global optimizer. It shows that if the sum of these probabilities is 
unbounded, then infinitely many evaluations of / will be arbitrarily close to the global 
optimum. This theorem applies even when / is evaluated with random noise. Since the 
location of any global optimizer is typically not known a priori, it is sufficient instead to 
require that Theorem 1 apply to every x E X , in addition to sets around global optimizers. 
This stricter, yet simpler, requirement may be expressed: 

vrai inf P k (B(x, £)) = oo, (14) 

i=l “ fc_1 
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for all e > 0, x € X. 

There are many ways of selecting probability distributions Pk such that (14) is satisfied. 
A common approach is to select the probability distributions P k according to 

Pk = n kPx + (1 — (*k)Gk, ( 15 ) 

where 0 < a < 1, Px is the uniform distribution on X , and G* is an arbitrary distribution 
on X. A realization, x kt from (15) may be obtained by sampling P x with probability a k and 
Gk with probability 1 - To satisfy (14), it is sufficient to require 

OO 

53 = °°- 

JL=1 

3.3 Methods of Generations 

Generational methods, also called methods of generations in the literature, sequentially sam- 
ple probability distributions that are asymptotically concentrated in the vicinity of a global 
optimizer multiple times. Each of these multiple samplings is referred to as a generation. 
These methods, which were first proposed in the late 1960’s, are based upon the three fol- 
lowing heuristics (Zhigljavsky, 1991, pg. 186): 

i. New samples of / should most often be obtained in the vicinity of previous, 
high-performance samples, 

ii. The number of new samples in the vicinity of a previous sample must depend on 
the observed value of / at that sample, 

iii. The breadth of the sampling distribution around the previous samplings should 
decrease as the global optimizer is approached. 

Generational methods have many desirable properties. In exchange for their inefficiency 
at solving easy global optimization problems, they are suitable for a wide range of prob- 
lem domains. In particular, they may be applied to very complex problems and they are 
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applicable when noise is present. Finally, as shown in Subsection 3.3.2, they have provable 
convergence properties. 

In this section, it will be assumed that the feasible region, X is a compact metric space of 
an arbitrary type. Furthermore, it will be assumed that the maximization problem is being 

considered. 

3.3.1 Presentation of Generational Methods 

The following procedure satisfies the three heuristics. It is based on the supposition that 
the result of evaluating / at a sample point x € X and iteration k is a nonnegative random 
variable y k {x) = /(*) + &(*), where &(ar) is also a random variable. B is the a- algebra of 
the Borel subsets of X. 

Procedure 3 Generalized Method of Generations Algorithm with Randomization 

1. Choose a distribution Pi on ( X , B) and set k = 1. 

2. Sample N k times the distribution P k to obtain the points x | l \ . . - , x$ t . 

3. Evaluate the random variables y k at the points xf\ where y k (x) = fk(z)+£{x) > 

0 with probability one, and f k is an auxiliary nonnegative function constructed using 
the observed values of / at the points for j = 1, . . . , x = 1, . . . , fc. If 

Em(*n=°- 

7=1 

then repeat the sampling by returning to Step 2. 

4. Construct the next distribution according to 

P M (dx) = Zpf'Q* (*?,<>*) ( 16 ) 

7=1 


where 



(17) 
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5. If the stopping criterion is satisfied, then stop; otherwise, substitute k + 1 for k and go 
to Step 2. 


The distribution P* + i in (1G) is sampled using superposition: first the discrete distribu- 


-(*> 
X \ > • 

• • i x N k 

D (t) 

Pi i • 

„(*) 
* * T Ptflc 


is sampled, then the distribution Qk(xf\ .) is sampled for each realization xf‘ (Zhigljavsky, 
1991, pg. 188). It will be assumed in the theoretical analysis of Procedure 3 that (16) will 
be sampled in this manner. In practice, however, variance reduction techniques are typically 
applied to the sampling procedure (Zhigljavsky, 1991, pp. 188-189). These techniques ensure 
that some of the best points are sampled with probability one. 

In Procedure 3, auxiliary, nonnegative functions, /*, are used to construct Pk+i- These 
functions should reflect the properties of /. For example, /* should, on the average, be 
greater where / is great and smaller where / is small. The choice of /* can greatly affect 
the quality of the resulting algorithm. Zhigljavsky suggests that the construction of these 
functions should done with a technique for extracting and using information about the 
objective function during the search or be based upon some technique of objective function 
estimation (Zhigljavsky, 1991, pg. 189). 

Procedure 3 may be terminated when a prescribed number of iterations have been ex- 
ecuted or according to some other criterion. Zhigljavsky suggests termination when the 
desired accuracy has been obtained. This may be determined using the methods for esti- 
mating M described in (Zhigljavsky, 1991, Ch. 4). 

There are also sequential variants of Procedure 3 (Zhigljavsky, 1991, §5.4). The distin- 
guishing characteristics of these algorithms are that the sampling distributions P k+ i(dx) may 
be constructed using points from all previous iterations, and, except for the first iteration, 
only one sample is obtained per iteration. 
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3.3.2 Convergence Properties 

In this subsection, the convergence properties of the global random search methods described 
by Procedure 3 will be considered. To prove that the sampling distributions of methods of 
generations weakly converge to the probability measure concentrated at a global optimum, 
Zhigljavsky places key requirements upon the local sampling components, Q k , and the global 
sampling components, R k . Of these requirements, two are placed on the local sampling 
components: 

1. The breadth of the distributions Q k must be reduced as the algorithm proceeds such 
that the sequence weakly converges to a probability measure concentrated at the point 
where it is located. 

2. The distributions Q k must somehow be constrained so that their expansive nature 
cannot overcome the convergence caused by the global sampling components, R k . A 
fortiori, these distributions must be designed to prevent diffusion away from global 
optima in the absence of selective convergence; otherwise, additional assumptions about 
the objective function, /, would be required. 

Without the first requirement it would not be possible to prove convergence of the sampling 
distributions to a probability measure concentrated at a global optimum or any other point. 
Zhigljavsky satisfies the second requirement in two ways. In Corollary 3 below, a form of 
local elitism is used to prevent dispersion of the sampling distribution away from global 
optima. In Corollary 4 below, the search breadth of the distributions Q k is required to 
be finite, and the breadth of these distributions are required to decrease rapidly enough so 
that the search range becomes bounded. Finally, the distributions R k are required to be in 
the form of proportional selection, (17) or (1). Heuristically, the sampling distributions of 
methods of generations converge to the global sampling distributions 

f k {x),i{dx) 

s MzMdz) 
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due to the requirements placed on the local sampling distributions Q k - Furthermore, as 
shown in Lemma 2, these distributions converge to global optima. 


Auxiliary Statements Below, two auxiliary lemmas of considerable importance and two 
associated corollaries are presented. Appendix B presents the assumptions upon which these 
results are based. The proofs for these results are presented in (Zhigljavsky, 1991, §5.2.2). 


Lemma 1 If the assumptions (a), (b), (c), (e), (f), (g), and (s) are satisfied, then 

1. the random variables with the distribution P M {dx i, . . .,dx M ) are symmetrically depen- 
dent; 


2. the marginal distributions Pf^[dx) = P\{{dx, X , — ,X) are representable as 

f R N {dz)f(z)Q{z, dx) 

P M (dx) = - j—_ 1- A N (dx), 

J Rv{dz)f{z) 

where Rrt(dz ) = Rs{dz, X , . . . , X); arid 


(19) 


3. the signed measures A^ converge to zero in variation for N — ► oo with the rate 
N- l ? 2 , i.e.,var{A N ) = 0(N ~ l /*), N -> oo. 


By substituting /*, N k , N k + 1 , P{k,N k - 1 ; .), P{k + l,N k ;.), P(k + 1 ,N k ;dx) — P(k 4- 
l,N k ;dx, X,...,X), and A(i b, N k , .) for /, N, M , R N {.), P M (), P M (dx), and A*(.), respec- 
tively, and applying Lemma 1, Zhigljavsky obtains the following assertion. 

Corollary 1 Let (a), (b), ( c ), and (e) be met. Then for any k = 1, 2, . . . and N k = 1,2,... 
the following equality holds for the unconditional distribution of random elements Xj : 

[ P{k, 7V fc _ i; dz)f k (z)R(k, N k , z- dx) 

P(k + 1, N k ; dx) = + - f (20) 

J P(k, N k -i; dz) f k (z) 

where 

R(k , N k , z- dx) = Q k {z, dx) + A (k, N k ] dx), 
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and the signed measures A(fc, N k ] .) converge in variation to zero for N — * oo with the rate 
of order for any k = 1,2, — 


This leads to the next corollary. 


Corollary 2 Let (a), (b), (c), and (c) be satisfied. Then for any k = 1,2,... the sequence 
of distributions P(k + 1 , N k \ .) converges in vaiiation for N k — ► oo to the limit distributions 
P k (.) and 

j Pk(dz)f k (z)Q k (z,dx) 


PwW = 


J P k (dz)h(z) 


( 21 ) 


Loosely speaking, Lemma 1 and Corollaries 1 and 2 above concern the distributions 
constructed by generational methods. The following lemma concerns the distributions con- 
structed by (17) alone. Appendix A provides a definition and three alternative characteri- 
zations of weak convergence. 

Lemma 2 Let (c), (d), (h), (i), and (j) be satisfied. Then the sequence of distributions 

d r n {x)fi(dx) ( 22 ) 

r(z)fi(dz) 

weakly converges to e*(dx) = £*• (dx) for m — ¥ oo. 

Convergence Properties The sufficient conditions for the weak convergence of the dis- 
tribution sequences (20) and (21) to e*(dx) for A: — ► oo will now be presented. The proofs 
for these results are presented in (Zhigljavsky, 1991, 5.2.3). ) 

Theorem 2 Let the conditions (c), (d), (e), (h), (i), and (j) be satisfied as well as (k) and 
(m) or (l) and (n). Then the distribution sequence determined through (21) or, respectively, 
through (20) weakly converges to e* (dx) for k — > oo. 

With the exception of conditions (m) and (n), all of the required conditions for Theorem 2 
are natural and reasonable. As mentioned previously, it is of great interest to determine 
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the sufficient conditions for the satisfaction of (in) and (n). In (Zhigljavsky, 1991, §5.2.3), 
Zhigljavsky formulates the sufficient conditions for distribution convergence to e*(dx) for the 
two theoretically most important ways of choosing the transition probabilities Qk{z, dx), as 
follows. 

Corollary 3 Let the conditions (c), (d), (e), (h), (i), (j), ( o ), (p), (q), and (t) be satisfied. 
Furthermore, let (k) be satisfied for the transition probabilities Tk(x,dz) of (59). Then the 
sequence of distributions determined by (21) weakly converges to e*(dx) for k — > oo. 

Corollary 4 Let the conditions (e), (h), (i), (j), (q), (r), and (t) be satisfied. Then the 
sequence of distributions determined by (21) weakly converges to e*(dx) for k — > oo. 

Zhigljavsky asserts that, like Theorem 2, Corollaries 3 and 4 may be reformulated to 
demonstrate the convergence of (20) to e*(dx). Corollary 4, the more non-trivial of the two, 
was then reformulated and proved. 

Corollary 5 Let the conditions formulated in Corollaries 1 and 4 be satisfied. Then there 
exists a sequence of natural numbers Nk (Nk — ► oo for k —¥ oo) such that the sequence of 
distributions P(k + 1, Nk',dx) determined by (20) weakly converges to e*(dx) for k — *• oo. 

4 Genetic Algorithms as Global Random Search Meth- 
ods 

Genetic algorithms are global random search methods. Accordingly, it is argued that genetic 
algorithm behavior is best described by the construction and evolution of the sampling distri- 
butions. Furthermore, it is preferred that these sampling distributions be described relative 
to the phenospace, rather than the genospace. However, genotypic sampling distributions 
are equally useful when the distribution of candidate solutions across the genospace is un- 
derstood or known. Matching the simplicity of the genetic algorithm itself, this perspective 
and the theory associated with it is remarkably simple. Furthermore, it will be shown that 
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this is a suitable theory for genetic algorithm behavior according to the criteria established 
in Section 2. 

The genotypic sampling distributions of genetic algorithms have been described previ- 
ously in the literature. The sampling distributions arising from proportional selection and 
mutation are presented in (Davis k Principe, 1991). Those resulting from proportional se- 
lection and one-point crossover are described in (Bridges k Goldberg, 1987; Whitley, 1993). 
Statistical measures derived from recombination operators and their relationship to the ob- 
jective function are presented in (Manderick, de Weger k Spiessens, 1991). The sampling 
distributions constructed using proportional selection, one-point crossover, and mutation are 
presented in (Vose k Liepins, 1991a). Recently, Vose independently recognized that the inter- 
pretation of the population transition operators as sampling distributions is a unifying theme 
that nicely connects his finite and infinite population models of genetic algorithms (Vose, 
1993b). 

This section applies the formalism and insights of the theory of global random search 
methods in Section 3 to genetic algorithms. First, the genetic algorithm is reformulated and 
generalized in terms of phenotypic search. Genetic algorithm behavior is then described in 
terms of three heuristics related to the procedural elements of genetic algorithms. Finally, 
the suitability of sampling distribution theory for describing genetic algorithm behavior is 
considered in the context of the criteria established in Section 2. 

4.1 Reformulating the Genetic Algorithm 

The canonical genetic algorithm searches the discrete space of attainable strings A, where 
a single string is denoted A or In Procedure 4, the canonical genetic algorithm is 
expressed in the form of the methods of generations in Subsection 3.3.1. It is assumed that 
if the objective function, / : A — > 8? 1 , is evaluated with noise at iteration k, then the result 
is a nonnegative random variable ijk{A) = f(A) -I- £^(.4), where £*(4) is a random variable. 

Procedure 4 The canonical genetic algorithm >\s a generational global random search method. 
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1. Choose a distribution P\ on A and set k = 1. 

2. Sample Nk times P k to obtain the strings A^\ A^. 

3. Evaluate the random variables Vk(Af^) at the strings Af \ where iJk{A) = f k {A) + 
€k(A) > 0 with probability one, fk is an auxiliary nonnegative function constructed 
using the observed values of / at the strings for j = 1, . . . , iV*, i = 1, . . . , fc, and 
f : A—t ft 1 is the fitness or objective function. If 

I >K’)=°. 

j = i 

repeat the sampling by returning to Step 2. 

4. Construct the next distribution according to 

P k +i(Ai) = ( 23 ) 

j'=i i"=i 


5. 


where 



t=i 

If the stopping criterion is satisfied, then stop; 
to Step 2. 


(24) 


otherwise, substitute k + 1 for k and go 


The construction of the sampling distributions {Pfc + i} in (23) is consistent with Lemma 
1 in (Vose & Liepins, 1991a) and it proceeds in two stages: a global phase and a local phase. 
The realizations Ay and Ay are obtained using global information about / contained in 
the population and (24). The local phase corresponds to recombination, which encompasses 
both crossover and mutation, and is performed with the transition probability Qk{Ay, Ay , .). 
The emphasis on the use of two samples for the construction of the transition probability 
distribution is the distinguishing characteristic of genetic algorithms from other global ran- 
dom search methods, including evolutionary programming (Fogel & Atmar, 1990; Back & 
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Schwefel, 1993) and evolutionary strategies (Back Sc Schwefel, 1993; Back, Hoffmeister Sc 
Schwefel, 1991). It is on the basis of these two samples and a similarity measure that the 
locality of Qk{Ay, Ay, .) is typically determined. This is discussed further in Subsection 4.2. 
The distribution P fc+ i in (23) is sampled using superposition: first the discrete distribu- 


A {k) A {k) 

i • • • i ^N k 


is sampled twice, then the distribution Qk (Ajf\ Ay) , .''j is sampled for each pair of realiza- 
tions and A^y . The transition probability Qk{A^ k \ A^), A) describes the probability 
of obtaining the realization A given the pair A^ and A^ k ) . The distribution Pk+i in (23) 
may also be sampled using a variance reduction technique (for examples, see (Baker, 1987; 
Baker, 1989; Zhigljavsky, 1991)). Finally, the distributions {P*+i} in (23) may alternatively 
be constructed to generate a pair of samples (Peck, 1993), 

N k N k 

Pk+i(Aj>, .4j») = y, 'Epfp^Qk ( 4 °, 4,^,^.), (26) 

where the transition probability Q k (A^\ A ^ k ) , A?, .4,») describes the probability of realizing 
the pair {A{i,Ai") given the pair (Ay\A^y^. 

The auxiliary functions fk in Step 3 should reflect the properties of f . That is, they 
should be greater when / is greater and smaller when / is smaller. Common choices of 
fk include functions for fitness scaling and ranking. These functions may, in general, be 
constructed using any subset of the previous samples. Generational genetic algorithms, 
however, typically only use .4^ -1 \ . . . , A§~^. 

The genetic algorithm may also be described in terms of the phenospace or feasible space 
X. In genetic algorithms, each string or element A of A is an encoding of a candidate 
solution x , which is an element of the feasible space X. Due to the mapping A4 : A — > X, 
the sampling distribution Qi(Ay, Ay >, .) on A constructed by selection and recombination 
also imposes a sampling distribution Qt(c', ~", .) on X. In other words, the realization x 
obtained from Qk(M{Ay),M{Ay), .) is identical to M(Ai), where Ai is the realization 
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obtained from Q k (Aj>, Ay, .). The genetic algorithm can then be generalized to search the 
phenospace, where the sampling distributions {pt+i} are constructed with respect to X 
according to 

P i+ i(<te) = jJ x R k (dz')R k (dz")Q t ^,z",dx), (27) 

where R k is a probability measure on X and Q k {z', z ", .) is a transition probability such that 
it is a measurable function with respect to the first two arguments and a probability measure 
with respect to the third. The distributions {Pi+i} are typically sampled using superposition: 
first realizations z! and z" are obtained by sampling R k , then Q*(z', z", .) is sampled to obtain 
x. Finally, the distributions {Pt+i} in (27) may alternatively be constructed to generate a 
pair of samples (Peck, 1993). 

In analogy to (2G), the distributions {Pt+i} may alternatively be constructed according 


to 


JV + i(<fa'.<fa") = U x Rd<iz')R t (dz")Q t (.z',z",dx',<k"), 


(28) 


where, once again, R k is a probability measure on X and Qi i (z > ,z n ,dz? t dx n ) is a transition 
probability such that it is a measurable function with respect to the first two arguments and 
a probability measure with respect to the last two arguments. For the purposes of analysis 
and discussion only (27) will be considered further. 

To generate distributions consistent with (27), the genetic algorithm may be generalized 
in the following form, where B is the <r-algebra of the Borel subsets of X: 


Procedure 5 The generalized genetic algorithm as a generational global random search 
method. 


1. Choose a distribution Pi on ( X,B ) and set k = 1. 

2. Sample N k times P k to obtain the points . . . , x^. 

3. Evaluate the random variables y/t(xj fc) ) at the points where ijk(xf^) = fk (x^ ) + 
ffc(Xj^) > 0 with probability one, f k is an auxiliary nonnegative function constructed 
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using the observed values of / at the points for j = 1, = 1 and 

/ : X -4 3? 1 is the fitness or objective function. If 

j=i 

repeat the sampling by returning to Step 2. 


4. Construct the next distribution according to 


where 


N k N k 


PUxte) = E 


,(*> J*1 


j'=l J"=l 



(29) 


(30) 


5. If the stopping criterion is satisfied, then stop; otherwise, substitute k + 1 for k and go 
to Step 2. 


4.2 Genetic Algorithm Behavior 

The construction and evolution of the distributions {Pjt+i} provide considerable insights into 
the interplay of the procedural elements. This level of abstraction lies between those of the 
procedure and the populational transition operators of Markov chain analysis. Furthermore, 
it is useful for understanding how genetic algorithms search the feasible space and how they 
generate increasingly better candidate solutions. It is also suitable for rigorous mathematical 
analysis and derivation of convergence properties. 

Genetic algorithms can be described on the basis of the three following heuristics, which 
are related to the procedural elements of genetic algorithms: 

i. the number of times a previous sample is chosen for constructing a transition 
probability, Q fc , is dependent on the function evaluation observed at that point, 
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ii. the similarities between previous samples should be exploited in the construction 
of the transition probabilities, and 

iii. often enough, the objective or fitness function behaves similarly on similar sam- 
ples. 

The description of genetic algorithm behavior begins with a randomly generated set of sam- 
ples from the search space (the initial population). For each sample, the objective function 
value is evaluated. Then pairs of high performance samples are competitively selected from 
the set of samples. For each pair of samples, another one or two new samples are randomly 
generated that are similar to the high performance samples. Since it is assumed that the 
objective function behaves similarly on similar samples, the new samples are also likely to 
be of high performance. The search process continues with the evaluation of the objective 
function at the new samples. Since the new samples also compete against each other in the 
selection process, the set of samples becomes increasingly concentrated in the high perfor- 
mance regions of the search space. As the samples become increasingly concentrated, they 
become more similar and the breadth of search dynamically decreases. Therefore, unlike 
most other global random search methods, genetic algorithms do not require predetermined 
schedules for controlling the construction of its sampling distributions. 

The word similar is critical in the above description. However, there is no similarity 
criterion that applies to all problem domains and search spaces. While not yet properly 
investigated for this purpose, the fitness correlation coefficient of an operator may serve as 
a useful measure of similarity (Manderick, de Weger & Spiessens, 1991). The similarities 
exploited by an algorithm may be either genotypic or phenotypic, depending on the na- 
ture of the implementation. In the canonical genetic algorithm, it is the similarities in the 
candidate solution encodings that are exploited. Each of the traditional crossover opera- 
tors (t.e., one-point, multi-point, uniform, and parameterized uniform crossover) preserves 
the portions or bits of the encodings common to both parents in the children. Searching 
is performed by exchanging or randomizing the remaining bits in some manner. Since the 
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likelihood of altering hits of the candidate solution encoding through the process of mutation 
typically decreases exponentially with the number of altered bits, mutation also results in 
encodings that are similar to the original encoding. Interestingly, it is in this manner that 
the string similarities common to high performance samples pervade later populations. A 
more extensive explanation for observations of schema growth that does not appeal to the 
schema theory is presented in (Peck, 1993, §5.4). 

In addition to considering the satisfaction of the second heuristic, we will now consider 
the other heuristics as well. In genetic algorithms, the first heuristic is satisfied by the global 
sampling phase, which is described by (24). The third heuristic is problem dependent. As 
addressed in Subsection 5.1, it is also dependent on the candidate solution representation. 
Furthermore, it has been pointed out that the genetic algorithm will degenerate into a 
random search if this heuristic is not satisfied (Rawlins, 1991). 

4.3 The Sufficiency of the Theory 

The mathematical description of the theory presented in this section is an exact represen- 
tation of genetic algorithms based on the procedural elements. Thus, any phenomena of 
genetic algorithms will be explainable in its context. The explanatory and predictive capa- 
bilities of the theory are drawn upon throughout the remainder of this paper. The theory 
is also robust with respect to algorithmic variations. Procedure 5, for example, allows for 
fitness scaling, ranking, non-traditional recombination operators, independence of the encod- 
ing method, and arbitrary search spaces. Consequently, this theory is sufficient according to 
the criteria established at the beginning of Section 2. 

Since both this theory and the theory presented in (Vose & Liepins, 1991a; Nix & Vose, 
1992; Vose, 1993a) are exact, they are isomorphic. Since they have different theoretical bases 
and levels of abstraction, however, these two analytical perspectives should be complemen- 
tary. These theories are distinguished from each other in two ways. The first is a change of 
emphasis or interpretation. In (Vose & Liepins, 1991a; Nix & Vose, 1992; Vose, 1993a), the 
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interpretation of the mathematics is lumped into a transition between populations. In the 
present theory, the emphasis is on how the components of the sampling distribution affect 
the search. The second distinguishing characteristic is the consideration of the phenotypic 
sampling distribution, if possible. 

5 Factors Affecting the Sampling Distributions 

Based on the conclusions of Section 4.2, understanding the factors affecting the sampling 
distributions {Pt+i} is particularly important for understanding, applying, and designing 
genetic algorithms. In pursuit of this understanding, this section addresses the issues associ- 
ated with the encoding of candidate solutions, the construction of the sampling distributions 
R k (i.e., selection), the construction of the distributions Q k (i.e., recombination), and pop- 
ulation management. 

5.1 Candidate Solution Encoding 

Genetic algorithms work by exploiting similarities between previous samples and they de- 
pend on the objective function behaving similarly on similar samples. A crucial design issue, 
therefore, is the choice of similarities to exploit. Ideally, these similarities should be chosen 
with respect to the nature of the candidate solutions and the problem under consideration. 

Typically, genetic algorithms encode candidate solutions and then exploit the similari- 
ties in the encodings. As a consequence, the choice of candidate solution encoding has a 
tremendous impact on the performance of genetic algorithms. According to the choice of 
encoding, a problem may be reduced to the archtypically easy “counting Ts” problem (Vose 

6 Liepins, 1991b), or genetic search may be rendered no more effective than a pure random 
search (Rawlins, 1991). 

For greatest benefit, the encoding method should be matched to the candidate solutions 
and the problem tinder consideration such that similar strings will result in similar candidate 
solutions. Unfortunately, it is not generally possible to preserve similarities in both A and 
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Hamming Distance Hamming Distance 

Figure 1: The relationship between the similarities of encodings and the similarities of the 
numbers they represent: left ) when natural code is used, right) when gray code is used. 

X. Typically, genetic algorithm practitioners simply rely upon the fortuitous existence of 
exploitable similarities. Since the use of binary encodings increases the number of oppor- 
tunities for exploitable similarities, it is not surprising that such encodings are the most 
commonly used. 

To illustrate the problems of choosing an encoding, the specific problem of encoding an 
integer is considered in (Peck, 1993). Both natural code and the gray code used in Genesis 
Version 5.0 (Grefenstette, 1990) are analyzed. Ten bit encodings were used to represent 
integers in the range [0,1023]. In this analysis, the Hamming distance and the absolute 
difference are used as similarity measures for the encodings and integer values, respectively. 
As shown in Figure 1, two similar encodings will not necessarily result in similar integers 
for either encoding method. In fact, no integer encoding longer than two bits can satisfy 
this objective. This is because an integer is adjacent to only two other integers, yet an 
integer encoded with f. bits, is a Hamming distance of one from exactly i other encodings. 
Figure 1 also suggests why genetic algorithms using these encodings are usually effective. 
The region between the 25th and 75th percentiles in each case shows that, in most instances, 
increasingly similar encodings result in increasingly similar integers. 

The above discussion illustrates that it is very difficult to design an appropriate candidate 
solution encoding scheme, even when the candidate solution is as simple as an integer. It is 
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also very difficult to envision the distribution of candidate solutions across A. This difficulty, 
combined with trying to understand how A is being sampled by selection and recombination, 
makes it very difficult to understand genetic algorithm behavior in either the genospace or 
the domain of the problem being considered. 

The many problems associated with encoding the candidate solutions and designing the 
sampling distributions to exploit string encoding similarities may very easily be eliminated by 
simply designing the sampling distributions to exploit similarities in the candidate solutions 
themselves. There is no theoretical requirement for the use of string encodings and there are 
many advantages to their elimination: 

1. The problem specific structure of X is typically much better understood than the 
distribution of candidate solutions across A. 

2. The recombination operators, Q t, may be customized to exploit knowledge of the 
structure and similarities of the candidate solutions that are pertinent to the problem 
under consideration. 

3. The behavior of the genetic algorithm will be better understood since the relationship 
of the sampling distributions to the structure of X will be better understood. 

4. Only the recombination operators are problem dependent, the remainder of the algo- 
rithm (Procedure 5) is unchanged. 

5. Mathematical analysis is easier due to the elimination of the mapping M.. 

Finally, it should be noted that designing genetic algorithms to search the phenospace, X , as 
opposed to the genospace, A t is already a common practice (e.g., consider order dependent 
problems). 

Radcliffe has also considered many of these ideas (Radcliffe, 1991b; Radcliffe, 1991a; 
Radcliffe, 1993). Referring to subsets of the search space as equivalence classes or formae, 
Radcliffe argues: 
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The critical tasks are thus finding fortnae which characterise solutions in meaning- 
ful ways and developing operators which usefully manipulate these formae (Rad- 
cliffe, 1991b). 

These formae are generalizations of schemata that are not necessarily defined with respect 
to string similarities. By considering recombination operators that characterize solutions 
in meaningful ways and do not necessarily exploit string similarities, the need for string 
encodings is effectively eliminated. 

5.2 The Rk Class of Distributions: Selection 

The distributions Rk in (27) make use of global information obtained about the objective 
function /. Furthermore, these distributions are largely responsible for concentrating search 
in high performance regions of the search space. Since the realizations obtained by sampling 
the distributions Rk are previously obtained samples of A', these distributions do not generate 
new candidate solutions or expand the search domain. 

To a great degree, the way of constructing the distributions Rk establishes the general 
structure and originality of a global random search method (Zhigljavsky, 1991). In the 
canonical genetic algorithm, proportional selection is used, as in (1). In practice, auxiliary 
functions f k related to the objective function / are typically constructed for the purposes 
of fitness scaling or ranking. The distributions R k are then implemented according to (30). 
Many other methods may be used instead of proportional selection (Goldberg & Deb, 1991; 
Back & Hoffmeister, 1991; de la Maza & Tidor, 1993), including the methods used in evolu- 
tion strategies (Back &; Schwefel, 1993; Back, Hoffmeister & Schwefel, 1991) and evolutionary 
programming (Fogel &; Atmar, 1990; Back & Schwefel, 1993). 

Proportional selection is very simple, is suitable for use in the presence of noise, and it 
has nice theoretical properties. Theorem 3 indicates that the the best string in the initial 
population eventually dominates the population (Peck, 1993; Peck &; Dhawan, 1993). This 
theorem simulates the effects of an arbitrarily large population by allowing fractional numbers 
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of individuals. Comparing (31) to (22) provides additional insights into genetic algorithm 
behavior. These equations are consistent with Equations (7) and (8) of (Goldberg &: Deb, 
1991). 


Theorem 3 The observed average population fitness, f(A(t)), at time t, and the number of 
instances of a particular string Ai at time t, m(Ai,t), resulting from the use of proportional 
selection may be expressed: 


and 


/(A(t)) = 




AjZA 

E m(^,0)/‘(/li) ' 


a ,eA 


m(Ai, t.) 


E m(^,0)my 

AjGA 


(31) 


(32) 


where N denotes the size of the population, and m(Aj, 0) = 0 if Aj A(0). 


Proof: The following inductive proof begins with the initial steps. By definition, 


Nrn(Ai,Q)f°(Ai) 

m{Ai ’ 0) “ e W 

Aj&A 


and 


/( A(0)) = ^ E m (At.0)/(.4j), 

A&A 

E rn(A i} 0)f l {Ai) 


Ai€A 


E rn(A,, 0)f(A,)' 

A, 6 A 

since Vf > 0, N = IT 4 ^^ m (Ai, t). Furthermore, 

m(.4..0)/(.40 

= ~nmT' 

_ m(.4j, 0)/ 1 (.4,-) 

jj E m{A j ,0)f 1 {A j y 

AjSA 

yVm(.4,,0)/ 1 (.4 i ) 

E m (Aj,0)/ l (.4j)’ 

■4,6.4 

and 


(33) 

(34) 

(35) 

(36) 

(37) 

(38) 
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/< A (1)) = TT E m ( A > .!)/(*). 

53 NrniA^fiA^KAi) 

_ 1 4 f 64 

~ ^ E m(A,,0)f\A t ) ' 

Aj€A 

53 m(.4,-,0)/ 2 (-4 t ) 

_ 4;64 

E m^.O)/ 1 ^)' 

4,64 


Let us now assume that 
m(A u k) = 


f(A(k)) = 


Nm(A it O)f k (Ai) 

53 TO^i.o)/*^-) 

4,64 

E m(.4 i ,0)/‘ +1 (.4 i ) 

4;64 

4,6-4 


m(-4i,/: + l) = 


m(4 k)f(Aj) 
f(A(k) ’ 


53 m(Aj, 0)f k (Aj) 

NmjA^ 0)f k {Ai)f(Ai) 4,64 

E m ('4j> 0)/ 4 (>lj) ' 53 

4,64 4^,64 

Wm(4 i ,Q)J»'(.4) 

E ™(4j,0)/* +, (4 j )’ 

-4,6,4 


/(A(fc + 1)) = T7 E m (-4*»A: + l)/(.4i). 


E AM*.o)/* +l (^)/M 

— J_.di£d (49) 

* E m(il it O)/* +I (^) ’ 

4 , 6-4 

E m(A t -,0)/ fc+2 (A t ) 

_ Aj£A /^Q\ 

E m(.4y,0)/ i+1 (-lj)' 

4,64 

Since it lias been shown that the theorem is satisfied for t — 0, 1 and that if the theorem 
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Generations 



Generations 


Figure 2: Ideal string and population fitness growth curves, based on a clumped initial 
population: left) The growth of instances of the strings having the indicated fitness, right ) 
The growth of the obseived average population fitness. 


is satisfied at t = k then it is also satisfied at t — k + 1, the process of induction completes 

the proof. m 

In (Syswerda, 1991), the effects of proportional selection on the growth of strings are 
investigated. Three cases are considered: the ideal (infinite population) case, the finite 
population case using the standard ‘roulette wheel’ proportional selection method, and the 
finite population case using a selection variance reduction technique, Stochastic Universal 
Sampling (SUS) selection method (Baker, 1987). In all three cases, the population fitnesses 
are initially clumped at specific values: .10% of the population has a fitness of 10, 10% has a 
fitness of 20, and so on, up to a fitness of 100. A number of interesting observations can be 
made from the presented results. In the ideal case, the growth curves, which were obtained 
using difference equations, are indistinguishable from those obtained using the equations of 
Theorem 3. The growth curves derived from Theorem 3 are presented in Figure 2. When 
a finite population and standard selection are used, the growth curves are nearly ideal, but 
noticeably different. When the variance reduction technique, is employed, the growth curves 
are indistinguishable from the ideal curves. 

In (Peck, 1993), an empirical study is performed to determine whether the discrepancy 
between the ideal growth curves and the growth curves using the finite population and stan- 
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Figure 3: The average proportion of individuals of different fitnesses, using standard propor- 
tional selection, in clumped population distributions of left) 10 individuals, and right) 100 
individuals. 
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Figure 4: The average proportion of individuals of different fitnesses, using SUS propor- 
tional selection, in clumped population distributions of left) 10 individuals, and light) 100 
individuals. 


dard selection is significant. Populations of 10, 20, 40, and 100 strings were investigated. 
Uncertainty in the results was reduced by averaging the curves from 1000 independent ex- 
periments. Both standard and SUS proportional selection methods were investigated to 
determine the effects of selection noise. Figures 3 and 4 present a portion of the results. 

The empirical results indicate that poorer performance should be expected when smaller 
populations are used, regardless of the selection method. Analytical proofs or explanations of 
this observation are presently unavailable. Using standard proportional selection, extinction 
of the best individuals was observed for populations of 10, 20, 40, and 100 individuals 
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in 40%, 20%, 3%, and 0% of the trials, respectively. Extinction of the best individuals 
is not possible using SUS proportional selection. Extinction, therefore, can explain some 
of the poorer performance, but not all of it. The poorer performance does seem to be 
well correlated with the sampling variance, however. There is higher sampling variance for 
the smaller populations and the performance is worse for smaller populations, regardless 
of the selection method. Furthermore, the use of the variance reduction technique results 
in improved performance. Unfortunately, the relationship, if any, between high sampling 
variance and poorer selection performance is presently not understood. 

5.3 The Qk Class of Distributions: Recombination 

The distributions Q k in (27) typically perform a localized search according to some similarity 
measure, and are referred to as recombination operator's in the genetic algorithm literature. 
The distributions Quiz', z", .) are dependent on two realizations, rf and z", which are likely 
to be of high performance since they are obtained through selection. These distributions are 
typically designed to exploit similarities between these two high performance realizations. 
These distributions can also be designed to exploit inferences about the local behavior of 
the objective function / based on the two samples, z 1 and z”, and their evaluations (Peck, 
1993). The dependence of the distributions Q k (z',z '\ .) on two samples combined with the 
use of selection 1 can eliminate the need for scheduling the narrowing of local search, which 
is required for most adaptive global random search methods (e.#., the simulated annealing 
and the methods of generations (Zhigljavsky, 1991)). Since this is typically done in genetic 
algorithms, both the distributions R k and the distributions Q k are typically adapted on the 
basis of information obtained during the search. 

In Section 4, it is argued that genetic algorithm behavior can best be understood by 
understanding the sampling distributions induced on the phenospace. Accordingly, the sam- 
pling distributions imposed on !R n by the traditional recombination operators will now be 

1 Recall that selection, or the sampling of the distributions Rl , concentrates the sampling distribution in 
the high performance regions observed globally. 
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considered with the use of a novel visualization technique. The operators that will be char- 
acterized are one-point crossover and uniform crossover. Other traditional recombination 
operators are visualized in (Peck, 1993). Due to the independence of the encoded parame- 
ters it is sufficient to consider the sampling of one dimension at a time, R l . However, due 
to the dualism between encodings and recombination operators (Battle & Vose, 1991; Vose 
& Liepins, 1991b), visualizations will be presented of the recombination operators applied 
to both natural code and the gray code used in Genesis Version 5.0 (Grefenstette, 1990). 
Finally, as is typically the case, the real values will actually be encoded as integers and used 
as a real value by applying an affine transformation. 

The objective of this visualization technique is to communicate where the realizations of 
the recombination operators, z f, y .), are likely to be obtained relative to the location of 

the parents, z* and z”. To fulfill this objective, all integers are encoded using six bits, and it is 
assumed that all pairs of parents are equally likely. For a particular pair of parent values, it is 
possible to compute the likelihood of realizing particular values given the recombination op- 
erator and the encoding scheme. A suitable visualization can be constructed by accumulating 
the marginal sampling distributions for sets of parent values separated by a given distance. 
To properly accumulate these distributions, they are translated by the amount required to 
position the mean of the two parents on the center column of the image 2 . Each marginal 
distribution is then used to construct a single row of the visualization, where the brightest 
pixel values correspond to the most likely realizations. The top row of the resulting image 
corresponds to the marginal sampling distribution of parents separated by a distance of zero 
(they are the same). Successive rows correspond to the marginal distributions of increas- 
ingly separated parents. Finally, the bottom row corresponds to the marginal distribution 
of parents separated by a distance of G3. As shown in (Peck, 1993), it is also insightful to 

visualize the feasible realizations by setting all locations with a positive probability of being 

2 The image requires a minimum of 127 columns because when both parents are 0, the marginal sampling 
distribution occupies columns G3-12G, and when they are both G3, the marginal sampling distribution occu- 
pies columns 0-G3. For all other combinations of parents, the marginal distributions fall into this range of 
columns. 
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realized to white, and all other locations to black. 

Figure 5 shows the sampling distribution resulting from the application of one-point 
and uniform crossover to integers encoded with 6-bit natural code. Figure 6 presents the 
visualizations resulting from the use of 6-bit gray code. These visualizations indicate that 
the distributions generated by one-point crossover are more concentrated in the vicinity of 
the parents than those resulting from uniform crossover. The salient characteristic of the 
sampling distributions resulting from the use of the gray code representation is that the 
breadth of search decreases as the distance between the parents decreases. 

In (Peck, 1993), one-point, two-point, uniform, and parameterized uniform crossover 
operators using both natural and gray encodings are applied to De Jong’s test suite (De Jong, 
1975), and their effectiveness is compared on the basis of five performance measures. It is 
found that those operators that tend to sample most often near the parents result in superior 
performance. Therefore, it may be concluded that concentrating and constraining search 
in the vicinity of the parents results in superior performance. This conclusion is further 
bolstered by the recommended settings of the recombination control parameters, such as 
crossover and mutation probabilities, which serve to further localize search. Finally, this 
conclusion has been favorably exploited in the design of a family of recombination operators 
for use when X C 5i n (Peck, 1993). An example of these operators and its visualization are 
presented in (52) and Figure 9, respectively. 

5.4 Management of the Population 

The population is the basis for the construction of the sampling distributions. The infor- 
mation obtained by the genetic algorithm up to a certain iteration is entirely contained in 
the distribution of the population’s samples and in the evaluations of the objective function 
obtained at those samples. In fact,, this information completely determines the distributions 
Rk . For this reason, it is arguable that the management of the population should have been 
discussed in Subsection 5.2. However, for the sake of clarity, the many issues associated with 
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Figure 5: Sampling distributions of one-point and uniform crossover search in the real domain 
with natural code representations: top) one-point crossover, bottom) uniform crossover. 
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Figure G: Sampling distributions of one-point and uniform crossover search in the real domain 
with gray code representations: to})) one-point crossover, bottom) uniform crossover. 
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the management of the population are considered here separately. The issues considered are 
those associated with the composition and creation of the population, the updating of the 
population, and the deletion of members from the population. 

5.4.1 Population Issues 

Of the two population issues considered in this subsection, population sizing and initializa- 
tion, population sizing is certainly the most thoroughly investigated in the literature. The 
population provides an estimate of the objective function behavior. Obviously, a larger popu- 
lation results in a more dense sampling of the objective function and a better estimate. If the 
objective is to ensure with a certain degree of confidence that the algorithm will adequately 
search the objective function, then the complexity of the phenospace and the characteristics 
of the objective function should be considered in the sizing of the population. If the function 
varies significantly in small regions, then a larger population will be necessary to provide an 
effective estimate, whereas a slowly varying function may be adequately estimated with very 
few samples. Similarly, a highly complex phenospace will require more samples, than a very 
simple one. The drawback to the use of larger populations is that the rate of improvement 
or convergence is slower when measured by the number of evaluations performed. 

The population sizing problem Inis been considered in the literature both empirically (De Jong, 
1975; Grefenstette, 198G; Schaffer, Caruana, Eshelman & Das, 1989; Jog, Suh &: Gucht, 1989) 
and analytically (Goldberg, 1989b; Reeves, 1993; Goldberg & Rudnick, 1988; Goldberg, Deb 
& Clark, 1992; Goldberg, Deb k Clark, 1993). The empirical studies have suggested pop- 
ulations ranging from 20-200, depending on the optimality criterion. Of the analytical 
approaches, information about the objective function is considered only in (Goldberg k 
Rudnick, 1988; Goldberg, Deb k Clark, 1992; Goldberg, Deb k Clark, 1993), albeit in the 
form of collateral noise. The favorable empirical results obtained with these methods might 
be explainable in terms of the objective function, the properties of the phenospace, and the 
relationship between the schemata and the phenospace. If so, they may provide the basis for 
population sizing methods that are based more directly on the first two properties. Such a 
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method would also be applicable when binary encodings of the candidate solutions are not 
used. 

A population management issue that has received little attention in the literature is 
improving population initialization. This literature is reviewed in (Peck, 1993), and a novel 
initialization technique based on stratified sampling is proposed. This method is motivated 
by the facts that reducing randomness can increase efficiency, and stratified sampling has 
been shown to dominate independent sampling (Zhigljavsky, 1991, §4.4). Stratified sampling 
involves dividing the sampling region, X, into m subregions of equal volume. Then, if 
N = m£ samples are desired, each of the m subregions is randomly sampled i times, using 
a uniform distribution. The effects of stratified initialization on genetic algorithm behavior, 
however, are negligible when applied to De Jong’s test suite using an initial population 
of 50 samples. This suggests that genetic algorithm behavior is robust with respect to 
slight variations of the initial population, which is desirable. Problems for which X or / 
is highly complex, or only a small initial population is possible, may benefit from stratified 
initialization. 

5.4.2 Sequentiality and Deletion 

Genetic algorithms adapt their sampling distributions based on information acquired during 
the search. Most commonly, the sampling distributions {Pjt+i} are sampled N times before 
they are updated, where N is the size of the population. In .sequential or steady-state 
variants, the sampling distributions are updated more frequently, such as after each sample. 
This makes it possible to exploit information sooner after it is acquired. The portion of the 
population that is replaced prior to updating the sampling distributions is described by the 
generation gap. 

Increased sequentiality results in increased selection noise or variance compared to the 
use of generational replacement and the use of sampling variance reduction techniques, such 
as SUS selection (Baker, 1987). Baker’s “Stochastic Universal Sampling” technique (Baker, 
1987). Sampling variance reduction 'techniques work by establishing codependencies among 
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the realizations of R The more samples there are to he obtained from R^, the more effective 
the sampling variance reduction technique will be. Selection variance is increased with the 
degree of sequentiality because fewer samples from 72* &re obtained at a time. Some of these 
assertions are supported in the literature. It has been concluded based on the use of uniform 
or random deletion that the potential advantages of overlapping populations are dominated 
by the negative effects of genetic drift or allele loss (De Jong, 1975; De Jong & Sarma, 1993). 
In (De Jong & Sarma, 1993), it is concluded that the higher variance associated with smaller 
generation gaps leads to greater variation of actual growth curves of individuals on a single 
genetic algorithm run, and more genetic drift or allele loss. 

Aside from the negative effects of increased selection noise, the performance of sequen- 
tial genetic algorithms is predominately determined by the deletion method. Consider the 
following strategies for removing samples from the current population to allow for the inser- 
tion of new samples. Best-in-first-out (BIFO) deletion, in which the best observed sample 
in the population is the first removed, would clearly result in a counterproductive influence 
on behavior. Conversely, worst-in-first-out (WIFO) deletion exploits observations very ag- 
gressively to concentrate samples in the highest performance regions encountered. Finally, 
last-in-first-out (LIFO) deletion would degenerate into a non-uniform random search with a 
very weak adaptive element, which is the last sample. Only WIFO deletion is in common 
use. 

In (De Jong & Sarma, 1993), the effects of the generation gap on performance are investi- 
gated. It is concluded that the growth curves of genetic algorithm selection are independent 
of the generation gap, and there is no compounding effect (De Jong & Sarma, 1993). These 
conclusions are based on the use of uniform deletion, the comparison of the ideal growth 
curves for generational genetic algorithms and steady-state genetic algorithms with uniform 
deletion, which are presented in (Syswerda, 1991), and on mathematical analysis. Uni- 
form deletion, however, is not an aggressive deletion method. Furthermore, it has been 
shown that steady-state genetic algorithms with uniform deletion are not actually identical 
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Figure 7: The ideal behavior of steady-state proportional selection with FIFO deletion, 
applied to the clumped population distribution: left) the average proportion of individuals 
of different fitnesses in the population, right) the average population fitness. 


to generational genetic algorithms (Peck, 1993). Conversely, advantages can be accrued from 
sequentiality. These advantages, illustrated by the use of first-in-first-out (FIFO) deletion 
applied to a sequential genetic algorithm, may be seen by comparing Figures 2 and 7. 

Many methods for deletion have been proposed for use in genetic algorithms (Syswerda, 
1991). These methods may be distinguished by whether the deletion strategy makes use of 
observed sample evaluations. Methods that do not use fitness evaluations, such as uniform 
and FIFO deletion, are preferred when the objective function is evaluated with noise since 
they will not result in a population biased by samples evaluated with favorable noise 3 . Con- 
versely, those methods that use fitness information, can have more aggressive exploitation, 
but they are not suitable for use in the presence of noise. To avoid premature convergence, 
however, care must be taken to ensure that Theorem 1 is not violated. 


6 Convergence Properties 

In this section, the convergence properties of genetic algorithms will be considered. First, a 
property of genetic algorithms that makes global convergence proofs difficult, if not impossi- 
ble, will be discussed. Subsequently, a simplistic remedy will then be provided. This remedy 
3 The effects of noise on genetic algorithms are carefully examined in (Peck, 1993, §7.2). 
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will be accompanied by proofs of convergence to global optima. 

6.1 Why Genetic Algorithms may not Converge 

While genetic algorithms satisfy Zhigljavsky’s requirements on the global sampling compo- 
nents, they do not satisfy the requirements on the local sampling components. As discussed 
previously, the sampling distributions of the recombination operators are constrained locally 
by the similarities of the two parent samples. However, the parents are chosen by a global 
sampling component. Therefore, the two parents may not be very similar. As a result, the 
recombination sampling distributions may not be adequately constrained or localized for 
convergence. 

The dependence of the local sampling distributions on two samples can have undesirable 
consequences, such as convergence to sub-optima and divergent behavior. To illustrate these 
effects, consider the following function with the feasible space X — x : x G [0, 1): 

F6 n (x) = - (z* + x - l) 4 + (.r 8 + x - l) 2 + a.r. (51) 

This function is illustrated in Figure 8 for values of nc equal to 0.22 and 0.23, respectively. 
This function has an optimum at approximately 0.9G with a narrow peak and a sub-optimal 
local maximum at approximately 0.35 with a broad peak. This function was designed such 
that a recombination event between samples from each peak will result in a disproportionate 
number of realizations in the larger, sub-optimal peak, and a recombination event between 
samples from the same peak will likely result in realizations within the same peak. 

If the breadth of the sampling distributions Qk is dependent on the distance between 
the parents, then it is expected that a sampling distribution tug-of-war will ensue between 
the large, sub-optimal mass and the smaller, higher performance mass. Selection will always 
favor the samples within the optimal peak. Thus, if recombination always resulted in a 
realization occurring on the peak of the parent sample around which Qk is centered, then 
selection woidd concentrate the population on the optimal peak. In this manner, samples 
may be stolen by the optimal peak from the sub-optimal peak. However, samples within the 
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Figure 8: An illustration of FG in the feasible space X = x : x € [0, 1) for a = 0.22, 0.23. 

sub-optimal peak will also be selected with positive probability. Due to the nature of F6, 
realizations of Qk centered at a sample within the optimal peak will often be obtained on 
the sub-optimal peak when the other parent sample is from the sub-optimal peak. If such 
a realization is then recombined with another sample from the sub-optimal peak, then the 
resulting sample will likely also be on the sub-optimal peak. In this manner, samples may be 
stolen from the optimal peak by the sub-optimal peak. Loosely speaking, if the rate at which 
samples are stolen from one peak to the other is exactly balanced by the other peak, then a 
steady state distribution or eigen-measure will occur. This situation would be unstable since 
a perturbation in the distribution will favor one peak or the other, which would be further 
reinforced by selection. 

To test the behavior of the genetic algorithm on this function, one of the three basic 
recombination operators proposed in (Peck, 1993) was used. The recombination operator is 
applied to each dimension independently. The basic form of its density is 



where <p(x) is an arbitrary symmetric density centered at zero, u = k\z' — z"|, and k is 
a control parameter. Densities of this form are constructed directly from the candidate 
solutions, are centered around each parent, and the search breadth is proportional to the 
distance between the parents. The concentration of the density around the parents can be 
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Figure 9: The sampling distribution of the triangular recombination operator with a base of 
width 1.0. 


controlled by varying k. In (Peck, 1993), </?(.'/;) is set to the Gaussian density, the triangular 
or roof density, and the uniform density. In this case, however, <p(x) = i(.x), where t{x) is 
the triangular density with zero mean and a base width of k = 1.0. A realization, r, of t(x) 
may be obtained from a realization, £, of a uniform deviate on the range [0, 1) according to 

r i(-i + v^) if £<o.5 


rfc) = 


l ± (l - v/2“=“2£) if £ >0.5. 

The visualization of the resulting sampling distribution is provided in Figure 9. 

To avoid premature convergence due to inadequate sampling and to reduce the stochastic 
effects, a population of 10,000 samples was used. This population was initialized by sampling 
a uniform distribution on the unit interval. Figure 10 shows the progression of sampling 
distributions for rv = 0.22 and a = 0.23. It was found that for values of a < 0.22 the 
sampling distributions will converge to the sub-optimal peak. It was also found that the 
sampling distributions will converge to the optimal peak when nr > 0.23. Figure 8 reveals 
that a small perturbation of nr has a very small effect on F6, but Figure 10 clearly indicates 
that the effect on the sampling distribution sequence is dramatic. These results confirm 
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Figure 10: Sampling distributions generated by F6: left) when or = 0.22, convergence is to 
the sub-optimal peak; right) when nr = 0.23, convergence is to the optimal peak. 


the unstable, tug-of-war behavior of genetic algorithms on this function. More importantly, 
however, these residts confirm that genetic algorithms can be expected to converge to sub- 
optima when applied to certain functions, even when the sampling of the objective function 
is adequate. Similar divergent behavior of canonical genetic algorithms has been observed 
on deceptive functions (Goldberg, 1987). 

6.2 Critical Requirements 

For Theorem 2 and its associated corollaries to be applicable, genetic algorithms must be 
representable in a form consistent with generational methods. This can be achieved by 
setting 

Q k {z', dx) = f*pk(dz")Q k {z',z" ,dx), 

where pk is described by (17). Thus, the genetic algorithm sampling distributions {Pjt+i} 
may be expressed according to (16). 

If assumption (p) of Section 3.3 were replaced with 

p'. the transition probabilities Qt (:/;', .) are defined by 

Qk{x',x",A) = l[s<:AJ k (x’)<M:)\Tk{x',x",<lz) + 

1 a ( x ) l[/t(c)</fc(*')]7t(^ ,x ,dz), (53) 
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where T k {x! ,x" ,dz) are transition probabilities, 

it would only be necessary to prove that the transition probabilities, T k {x' ,x" ,dz), weakly 
converge to s x >(dz) for k -» oo and for all x' € X to satisfy the requirements of Corollary 3. 
To prove this, however, would require additional assumptions on the objective function /. 

To meet the requirements of Corollary 4, satisfaction of the following assumption would 
be sufficient. 


r / . the transition probabilities Q k (x', x", dz) are defined by 


Q k (z',x",dz) 


Ck(x')(p ((z - x)/Pk) l l n{dz), 


(54) 


where ip is a continuous symmetrical finite density in 5i n , 

o° 1 

Pk > o, Yifik < OO, c fc (x) = — f ; 

k = l <p{{z - x)/Pk) tkn(dz) 

J A 

The novel recombination operator described by (52) may be expressed in the form of (54) 
with Pk = «|x - ' — x"\. To verify the satisfaction of this assumption, it must be proved that 

£ Pk < °°- 

k = 1 

The reason why this is not generally possible is discussed in subsection G.l. 

6-3 Ensuring Convergence to a Global Optimum 

In the previous subsection, the missing links in applying Zhigljavsky’s convergence proofs to 
genetic algorithms were revealed. In both cases, the critical requirement is proving that the 
distributions Q k weakly converge sufficiently quickly to a probability measure concentrated 
at a point. 

Rather than proving this property, it is possible to simply redesign the sampling distri- 
butions Qk to ensure this property is satisfied. Consider the following assumption: 

v". the transition probabilities Q< (x', x", dz) are defined by 


Qk(x\ x", dz) = ck{x')<p ((z - x')/Pk) fi n (dz), 


(55) 
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where <p is a continuous symmetrical finite density in 9? n , 


fa = min{a|z' — .T w | t 7 t}, (56) 

x' ^ x", c k (x) = » 

J x <p{{z -z)IPk)Hn{dz) 

and 

OO 

ik > o, 53 7 * < 

k=l 

Selecting fa as in (5G) allows the continued exploitation of similarities for adaptation and 
improved efficiency, and it forces the reduction of local search breadth at a sufficient rate to 
prevent diffusion of the sampling distribution away from global optima. To allow for nearly 
normal genetic algorithm performance, a conservative 7 k schedule, which satisfies (r")> could 
be used. 

Using the assumptions in Appendix B, the assumption that the feasible space, X, is a 
compact metric space of arbitrary type, and assumptions (p ) and (r ) above permit the 
following corollaries. 

Corollary 6 Let the conditions ( c), (d), (e), (h), (i), (j), ( 0 ), (q), (t), and (p' ) be satisfied. 
Furthermore, let ( r" ) be satisfied for the transition probabilities Tk (.x 7 , x ,dz) of (53). Then 
the sequence of distributions determined by (21) weakly converges to £ {dx) for k —¥ 00 . 

Proof: All of the conditions of Corollary 3 are satisfied. ■ 

Corollary 7 Let the conditions ( e), (h), (i), (j), (q), (t), and (r" ) be satisfied. Then the 
sequence of distributions determined by (21) weakly converges to £*(dx) for k — >• 00 . 

Proof: All of the conditions of Corollary 4 are satisfied. ■ 

Corollaries 6 and 7 demonstrate that genetic algorithms can be constructed in a manner 
to ensure convergence to a global optimum. 

Interestingly, even when very small values of n where used in (51), a genetic algorithm 
using forced local search reduction (FLSR) applied to the distribution in (52) consistently 
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converged to the global optimum. FLSR has also been applied to other novel recombination 
operators and shown to be highly effective when optimizing the functions in De Jong’s test 
suite (Peck, 1993). 

7 Conclusions 

In this paper, the theory of global random search methods is applied to genetic algorithms, 
and genetic algorithms are generalized into a broader class of methods. This broader class 
includes those global random search methods with probability transition operators that are 
dependent on two globally obtained samples. 

A primary tenet of this paper is that the construction and evolution of the sampling 
distributions {P*+i}, particularly in the context of the phenospace, is the preferred basis 
for understanding genetic algorithm behavior. It is the preferred basis because it operates 
at the level of abstraction most appropriate for understanding the interplay among the 
search of the objective function, the procedural elements, and generating mechanisms of the 
genetic algorithm. Accordingly, the genetic algorithm is reformulated in terms of sampling 
distributions and generalized in terms of the phenospace. Three heuristics to aid in the 
understanding of genetic algorithm design and behavior are also introduced. 

The factors affecting these sampling distributions are considered extensively. It is con- 
cluded that: there are many advantages to exploiting candidate solution similarities directly, 
selection variance can be expected to degrade performance, the best traditional recombina- 
tion operators have localized search distributions that are increasingly constrained in breadth 
as the distance between the parents decreases, genetic algorithms are robust with respect to 
initial populations, and FIFO deletion is more exploitative than generational replacement. 

Sufficient conditions for convergence to a global optimum are also established. These 
conditions ensure that the transition probabilities, which are otherwise constrained primarily 
by the similarities of two globally obtained and possibly dissimilar samples, are adequately 
localized. These sufficient conditions for convergence, however, are purchased at the cost of 


51 



one of the most appealing characteristics of genetic algorithms: its totally adaptive nature. 
To theoretically ensure weak convergence to a global optimum, a schedule for constraining 
the search breadth of the recombination operator must be supplied. 

There are many opportunities for further research related to this paper: deriving the 
relationship between high sampling variance and poorer selection performance, reducing se- 
lection sampling variance in sequential or steady-state methods, reexamining the population 
sizing problem to make the dependencies on the complexity of X and / explicit, weakening 
the sufficient conditions for the weak convergence of genetic algorithms to a global optimum, 
and developing a fully adaptive method that is provably convergent, but does not depend 
on scheduled control of the transition probabilities. 

A Weak Convergence 

In this appendix, weak convergence is defined. The presentation is adapted from (Billingsley, 
1971). 

Let X be a separable and complete metric space. Denote the interior, closure, and 
boundary of a set S as S ° , S ~ , and OS, respectively, where dS is S — S . Denote the class 
of bounded, continuous real-valued functions on X as C{X). Let the rr-algebra generated by 
the open sets in X be denoted B , and note that all functions in C{X) are measurable with 
respect to B. 

Weak convergence is concerned with the nonnegative, completely additive set functions 
P on B for which P{X) = 1 probability measures). A set S whose boundary satisfies 
P{dS) = 0 is referred to as a P-continuity set. If P k and P are probability measures on 
(. X , B ), then P k converges weakly to P, denoted P k ^ P, if 

lim f fdP k = [ fdP (57) 

fc-+oo J X Jx 

for all functions / in C(X) (Billingsley, 1971). The convergence of integrals of functions forms 
the basis of this definition of weak convergence. Weak convergence may also be characterized 
in terms of the convergence of the measures of sets. 



Theorem 4 These, conditions are. equivalent: 


a. P k =► P, 

b. limsup k P k (F) < P(F) for all closed F, 

c. lirninffc P k (G) > P{G ) for all open G, 

d. li mjt P k (S) = P(S) for all P -continuity sets S. 

Proof. A proof is provided in (Billingsley, 1971, Thin. 2.1). 

B Assumptions 

The following list comprises the assumptions used in this paper. These assumptions and the 
following commentary are adapted from (Zhigljavsky, 1991, §5.2.1). 

a. €k( x ) for any x € X and k = 1,2,... are random variables having a zero- 
mean distribution F k (x,d£) concentrated on a finite interval [— d, rf]; and the 
random variables &,(a:i),& 2 (.'£*), • • • are mutually independent for any k\, k 2 , . . . 
and Xi,X 2 , . . . from X; 

b. y k (x) = f k {x) -i- £ k (x) > Ci > 0 with probability one for all x € X, k = 1,2,...; 

c. 0 < ci < f k (x) < M k = sup f k (x) < C < oo for all x € X, k = 1, 2, . . .; 

d. the sequence of functions f k (x) converges to /(./;) for k — > oo uniformly in 

e. Q k (z,dx ) = q k (z,x)fi(dx), 

sup q k (z,x) <L k < oo 

for all k ■= 1,2,... where // is a probability measure on (X, B)\ 
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f. the random elements X \ > • • • > Xn with a distribution R(dxi , . . . , dxyv) defined on 
B n = a(X x ... x X) are symmetrically dependent 4 . That is, for any choice of 
distinct positive integers i\, . . . , iyv, the joint distribution of 

Xi\ 1 ' * * 1 X*Af 

depends only on N and is independent of the integers n, . . . , (Blum, Chernoff, 
Rosenblatt Sc Teicher, 1959); 

g. the probability distribution P M {dx u dx M ) on B M is described in terms of the 
distribution Rft(dx \, . . . , dx^) through 

M N 

P M {dx u ...,fh: M ) = U{dQ N ) Wat) £A(^&,d*j), (58) 

JZ j=l i=l 

where 

&N = 

Z = Xx[-d,d], 

n (d® N ) = R N (dzi,. . .,dz N )F(z u d£i) .. .F(zn^n), 

*(©w) = . 

>=i 

a (z,t,dx) = (/OO + OQM*); 

h. the global maximizer x* of / is unique, and there exists e > 0 such that / is 
continuous in the set B{x*,e) = B(e)\ 

i. fi is a probability measure on (/V, B) such that h(B{e)) > 0 for any £ > 0; 

j. there exists £ 0 > 0 such that the sets A(e) = {x € X : /(x*) - /(x) < e} are 
connected for any e, 0 < e < 

'‘Symmetrically dependent random variables are also called interchangeable (Blum, Chernoff, Rosenblatt 
& Teicher, 1959) and exchangeable (Loeve, 1903). 
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k. the sequence of probability measures Q*(x, dr) weakly converges to e x (dx), for 
any x € X as k -> oo, where e x (dx) is the probability measure concentrated at 
the point x; 

l. the sequence of probability measures R(k, N k ,x;dz) weakly converges to £ x (dx), 
for any x € X as k — > co; 

m. for any e > 0 there are S > 0 and a natural k 0 such that P k (B(e)) > 5 for all 
k > kv] 

n. for any e > 0 there are 6 > 0 and a natural k 0 such that P(k, N k ~i', B(e)) > S 
for all k > k 0 ] 

o. the functions f k , for k = 1,2, . . . are evaluated without random noise; 

p. the transition probabilities Q k (x,.) are defined by 

Q k (x, A) = l( s ea,A(r)<A.(i)]7fc (•'••, dz) + U(.r) jf. ^[f k (z)<f k ( x )]T k (x, dz), (59) 

where T k {x,dz ) are transition probabilities, weakly converging to £ x (dz ) for k — > 
oo and for all x € X ; 

q. P\{B{. r,e)) > 0 for all e > 0, x € X; 

r. the transition probabilities Q k (x,dz) are defined by 

Q k {x, dz) = c k (x)<p ({z - x)/ft k ) finidz), (60) 


where tp is a continuous symmetrical finite density in SJ n , 


&> 0 , 


Pk < oo, 


<k{x) = 


j x <P ((- - x)/Pk) Hn(dz) 
s. /*(•*) = f{x), £k{x) = ((x), Q k (x, dz) = Q(x, dz) for each k = 1,2,...; and 
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t- fk (x) = f{x) for k = 1,2,... 



A few of Zhigljavsky’s comments regarding these assumptions will now be related. 

Condition (a) makes two basic requirements on the evaluation noise: it must be inde- 
pendent, and it must be concentrated on a finite interval. The requirement of finiteness is 
particularly important. If the evaluation noise at a suboptimal point is positive and very 
large, then all subsequent evaluations will occur in its vicinity with large probability. This 
holds even if the search was already concentrated at the global maximizer. 

The requirement of condition (b) may be easily satisfied by constructing an auxiliary func- 
tion f k (x) from f h (x) such that (b) is satisfied. If an n k is known such that P{sup|&(x)| < 
a*} is equal or almost equal to one, then a function f k {x) based on f k {x) that can be made 
arbitrarily close to maxfo, f k {x) + constant} is presented in (Zhigljavsky, 1991). 

The conditions (li), (i), and (j) are natural and non-restrictive (Zhigljavsky, 1991). The 
uniqueness requirement of the global maximizer x* is imposed to simplify some formulations. 
Zhigljavsky notes that the results presented actually deal with distribution convergence to 
a distribution concentrated on the set 


fl D {argiiiax/(z)} ( 61 ) 

€>0 


instead of convergence to e x -{dx). Therefore, the uniqueness requirement can be relaxed, 
and convergence can be understood in this sense. Condition (j), when imposed, does require 
that the set (Gl) be connected. 

Necessary requirements on the parameters of Procedure 3 are formulated in conditions (e), 
(k), and (1). Distributions satisfying these requirements, however, are very easily constructed. 

The assumptions formulated in (f), (g), and (s) are not requirements. They are only 
auxiliary tools for formulating Lemma 1. In this formulation, ©w is an iV-fold sampling of 
X and the noise process (*.c., Q N € Z N ). The probability of sampling a subregion of Z N 
is described by the distribution Tl(dQ^). The sampling distribution for a particular dx is 
described by 


N 


N 


*(Sw)Ea (*,&.*) = E 


/(^i) + 6 


1=1 


fcEf-i(/(*i)+ei) 


i dx ) , 


5G 



which is analogous to (1G) in Procedure 3. 

Assumptions (in) and (n) may he regarded as conditions imposed on the parameters of 
Procedure 3. Since these conditions are not constructive, easily verifiable conditions sufficient 
for the validity of (rn) or (n) are of interest (Zhigljavsky, 1991). The conditions (p), (q), and 
(r) represent such sufficient conditions for two widely used forms of transition probabilities. 
A realization yu from (59) may be obtained by sampling the distribution 7]t(ar, .) to get C* 


and setting 


Vk 


C k if fk{Ck) > /*(*) 
x otherwise. 


This form of transition probability is suitable only when the functions fk are evaluated with- 
out noise. When noise is present, (GO) is a natural way of determining transition probabilities 
for X C 9ft". A random realization tjk in X from the distribution Qk(x,.) in (GO) may be 
obtained by repeatedly sampling to obtain a realization Ci until x 4 - Q € X, then setting 
y k = x + C*. When X C 5R'*, the transition probabilities of Tk{x , .) of (59) may be chosen 
using (60). 

Zhigljavsky finally observes that condition (q) places requirements on both X and Pi. 
When X C 9?" and X is of non-zero Lebesgue measure, then (q) means that the Pi-measure 
of any non-empty ball in R n with the center in X is larger than zero and that X has no 
appendices 5 . 
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